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Abstract. Quantum machine learning has received significant attention in recent years, and promising progress 
has been made in the development of quantum algorithms to speed up traditional machine learning tasks. In this 
work, however, we focus on investigating the information-theoretic upper bounds of sample complexity—how many 
training samples are sufficient to predict the future behaviour of an unknown target function. This kind of problem 
is, arguably, one of the most fundamental problems in statistical learning theory and the bounds for practical settings 
can be completely characterised by a simple measure of complexity. 

Our main result in the paper is that, for learning an unknown quantum measurement, the upper bound, given 
by the fat-shattering dimension, is linearly proportional to the dimension of the underlying Hilbert space. Learning 
an unknown quantum state becomes a dual problem to ours, and as a byproduct, we can recover Aaronson’s famous 
result [Proc. R. Soc. A 403, 3089-3144 (2007)] solely using a classical machine learning technique. In addition, 
other famous complexity measures like covering numbers and Rademacher complexities are derived explicitly. We 
are able to connect measures of sample complexity with various areas in quantum information science, e.g. quantum 
state/measurement tomography, quantum state discrimination and quantum random access codes, which may be 
of independent interest. Lastly, with the assistance of general Bloch-sphere representation, we show that learning 
quantum measurements/states can be mathematically formulated as a neural network. Consequently, classical ML 
algorithms can be applied to efficiently accomplish the two quantum learning tasks. 


1. Introduction 

Statistical learning theory [1, 2] or Machine Learning (ML) [3] is a branch of artificial intelligence which aims to 
devise algorithms for machines to systematically learn from historic data. Typically, ML has been separated into 
unsupervised learning and supervised learning. In unsupervised learning, the machine is most useful for finding the 
hidden structure, e.g. clustering or density estimation, within unlabeled data. In supervised learning, the machine 
is equipped with more power to predict the class or to infer the characteristic from the structured data. The figures 
of merit for a learning machine include: (i) computational complexity which measures the run-time efficiency of a 
learning algorithm; (ii) sample complexity which determines the number of queries to a membership made by the 
learning algorithm such that the hypothesis function is Probably Approximately Correct (PAC) [4]; and (iii) model 
complexity (otherwise called generalization error [5]) which is defined as the discrepancy between the out-of-sample 
error and the in-sample error. Note that model complexity is closely related to sample complexity in the sense 
that a learning machine with large model complexity requires more samples to accurately approximate the target 
function, which results in high sample complexity. Current research trends include the reduction of computational 
complexity due a large volume data set (big data) as well as the high dimensional features of each data point, and 
how to balance model complexity with in-sample error such that the training data set can be trained well without 
the occurrence of overfitting. 

Quantum Information Processing (QIP) is an active field that studies the computational capability in quantum 
systems. In recent years, QIP has achieved significant breakthroughs [6]: factorizing large prime integers with 
an exponential speed-up [7] and searching an unstructured database with a quadratic speed-up [8] are two most 
famous examples. There are two features of QIP that result in dramatic improvement over classical information 
processing: (1) The superposition principle: contrary to the classical bit, which takes discrete value either 0 or 1, 
a quantum bit (or qubit) can be in any linear combination of two quantum states |0) and |1). The principle is 
a consequence of the fundamental property of quantum mechanics—the linearity of Schrodinger’s wave equation. 
Therefore, the superposition principle allows the outcomes of parallel quantum computation to be stored in a single 
quantum state, which gives quantum machines more computing ability than classical devices. (2) Entanglement: 
quantum entanglement is the most remarkable phenomenon in quantum theory. This resource plays a crucial role 
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in numerous results, including quantum Shannon theory [9-12], quantum error-correcting codes [13-16], and so on. 
These features make QIP a multidisciplinary research area with a broad range of promising applications. 

Owing to the successful achievements of QIP, researchers have begun to explore whether QIP can advance other 
subjects of classical computer science. Consequently, the interdisciplinary area of quantum machine learning [17, 18] 
has attracted substantial interest lately. The central problems are two-fold. The first kind of problem investigates 
how QIP can improve classical ML tasks by converting classical algorithms partially or totally to a quantum 
algorithm. More precisely, one studies how quantum machines can serve to accelerate the ML process to improve 
computational efficiency, or to reduce sample complexity by transforming classical training data into special sets of 
quantum states. We call this line of research Quantum Computational Learning [19-37]. On the other hand, certain 
fundamental quantum problems, such as the inference of unknown quantum states or operations, or the hidden 
structure of the underlying quantum system, fits well into the setting of statistical learning theory. However, it 
requires certain generalisation of current theory of machine learning to accommodate the operator-valued inputs 
and/or outputs. We term this line of research Quantum Statistical Learning 1 [33, 36, 38-54]. 

Current achievements in Quantum Computational Learning come from quantum enhancement of the computation 
procedures such as optimization, inner product of big data and ability to compute classical functions in parallel. 
For example, Servedio and Gortler [19-22] considered two standard learning models of Boolean functions: Angluin’s 
[55] exact learning from membership queries, and Valiant’s [4] PAC learning from examples. By defining quantum 
extensions of the classical oracles to manipulate classical binary data, it was shown that the quantum oracles and 
classical machines are polynomially equivalent in terms of sample complexity. Anguita et al. (2003) [23] used the 
method of Durr and Hoyer [56] to perform the optimization process in support vector machine (SVM). A'imeur, 
Brassard, and Gambs (2006) [24, 26] applied a modified Grover’s algorithm [8] in clustering problems. Lloyd et al. 
(2013) [28-30] introduced a quantum random access memory [57] to store classical data and proposed an efficient 
density matrix exponentiation method to improve the computational procedure of supervised, unsupervised and 
SVM algorithms. Additionally, Lloyd et al. [34] also provided quantum algorithms to execute topological analysis 
for big data. Pudenz and Lidar (2012) [25] considered the verification of software and applied adiabatic quantum 
computation methods to solve the quadratic binary optimization problem. Wiebe, Kappor, and Svore (2014) [31] 
(Microsoft Research) modified Lloyd’s approach and proposed a quantum nearest-neighbor algorithm. Surprisingly, 
they showed that the number of queries depends on the sparsity and maximum value of the training data rather 
than on the feature dimension. Wang (2014) [32] combined phase estimation and the dense Hamiltonian simulation 
technique to improve the ML performance in curve fitting. Cross et al. [35] considered the problem of learning 
parity functions in the presence of noise. They showed that the quantum oracle is computationally efficient than the 
classical counterpart. Schuld et al. [36] presented a quantum pattern classification and discussed its advantages. 
Recently, Wiebe et al. [37] successfully applied quantum computers to perform an important machine learning 
task—deep learning. We refer the interested readers to Ref. [17, Table 1.1], where Wittek provides a detailed 
comparisons of existing quantum machine learning algorithms. 

On Quantum Statistical Learning, A'imeur, Brassard, and Gambs [38] introduced the task of quantum clustering , 
where the goal is to group similar quantum states (according to some fidelity measure) while putting dissimilar states 
in different clusters. In Ref. [40], Gambs (2008) studied the task of quantum classification , in which the training 
data set contains pure states from (classical) binary classes. By forming the statistical mixture states of each class, 
the Helstrom measurement forms a binary classifier which minimizes the training error. Gu(a and Kotlowski (2010) 
[41] researched the problem of classifying two unknown mixed states and used the technique of local asymptotic 
normality to derive the optimal classifier. Senfs et al. [46, 47] also proposed the strategy to perform quantum 
state classification. Nevertheless, the approaches developed by Gambs, Gu(a and Kotlowski, and Senfs et al. are 
essentially quantum hypothesis testing rather than quantum ML (since they do not consider the model complexity 
or sample complexity problems). In [42], Bisio et al. (2010) considered learning a unitary transformation as the 
storing-retrieving problem and proposed an algorithm for learning U based on the quantum network. Further, 
Bisio et al. (2011) [43] generalised the previous work to quantum instruments. Gross and Flammia et al. [44, 45] 
integrated the compressed sensing methods to quantum states tomography and proposed an algorithm to practically 
learn row-rank quantum states. In the latest work, Lu and Braunstein (2014) [48] studied the quantum version 
of the decision tree (QDT). In their model, both the input variable x and output label y are represented as pure 
states |a:), and | y). The von Neumann entropy is used as the node splitting criterion to construct the quantum 
decision tree classifier. Several works [33, 36, 49, 50] has engaged in developing possible models of quantum neural 
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Figure 1. Current Development of Quantum Machine Learning. ‘Quantum Computational Learn¬ 
ing’ investigates how quantum machines can serve to accelerate the ML process to improve com¬ 
putational efficiency, or to reduce sample complexity by transforming classical training data into 
special sets of quantum states. In this line of research, both the input space X and output space y 
are classical. On the other hand, ‘Quantum Statistical Learning’ studies the inference of unknown 
quantum states, operations, or hidden structure in the quantum system. We term the quantum 
version of classical statistical/stochastic model as ‘Quantum Stochastic Model’. 



network (QNN). Another interesting topic is the hidden quantum Markov model (HQMM) [51-54], where the state 
of the system is described by the density operator and the transitions are determined by the completely positive 
trace-nonincreasing map. It has been shown that the HQMM can be implemented by open quantum systems with 
instantaneous feedback. 

We summarise the current development of supervised quantum ML in Figure 1. Note that the majority of 
previous works in quantum machine learning focused on computational aspects of a learning algorithm. The issue 
of sample complexity exhibited in original quantum learning setting, e.g. state/process tomography, was rarely 
touched. Aaronson [39] pioneered the study of the learnability problems in the quantum regime, and derived upper 
bounds on the sample complexity of learning quantum states. In this work, we start from a machine learning point 
of view to formalize the problems of learning quantum measurements and quantum states as learning real-valued 
functions on Banach space. For learning an unknown quantum measurement, we apply a sequence of quantum 
states through the measurement apparatus and obtain the statistics of each measurement outcome. Our goal is to 
infer the most likely quantum measurement from the hypothesis set, which ‘behaves’ like the target measurement 
on the collected data. In this paper, we mainly focus on learning an unknown two-outcome measurement, which 
resembles a ‘yes-no’ instrument. For multi-outcome measurements, the results can easily be generalised 2 . For 
learning quantum states, on the other hand, the training data set is the collection of two-outcome measurements 
and the associated statistics. The core problem now is to analyse whether the target quantum measurement is 
learnable and to characterise the performance of the learning tasks. 

1.1. Contributions of this work. In this work, we answer the following two questions in quantum ML. 


2 In the scenario of learning multi-outcome measurements, each POVM element can be considered as a two-outcome POVM. Hence, the 
learnability of each POVM element can be derived by following the proposed paradigm. We note that this problem can be tackled by 
the multi-label learning algorithms (also called multi-target prediction or multivariate regression.) 
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How many quantum states are sufficient to learn a quantum measurement? Assume there is an 
unknown two-outcome quantum measurement device, and we can prepare a set of quantum states that are randomly 
drawn from an unknown distribution. Suppose that the outcome statistics of the set of quantum states are known. 
Can we infer the unknown quantum measurement from the quantum states at hand? How many samples of quantum 
states are needed for the learning machine to decide an optimal quantum measurement from the hypothesis set? Can 
the chosen candidate approximate the target measurement with the desired accuracy? These questions are typical 
sample complexity problems in statistical learning theory, and the answer lies in a proper quantification of the 
“effective size” of the hypothesis set. In this paper, we propose a framework (see Section 3) to connect the problems 
of learning two-outcome measurements with the tasks of learning real-valued linear functional on quantum states. 
By exploiting Banach space theory and the noncommutative Khintchine inequalities [58] in Random Matrix Theory, 
we prove (Theorem 4.1) that the complexity measure— fat-shattering dimension —is upper bounded by 0(d/e 2 ). 
Under the same framework, other complexity measures, such as covering numbers and Rademaclrer complexity, 
can be derived. As a result, the number of required sample states to learn an unknown quantum measurement is 
proportional to the dimension of the Hilbert space. 

How many quantum measurements are sufficient to learn a quantum state? Following the paradigm 
of learning quantum measurements, we can similarly formalize the problem of learning an unknown state into its 
dual problem. Unlike Aaronson [39], we employ tools solely from statistical learning theory to show (Theorem 5.1) 
that the fat-shattering dimension is 0(logd/e 2 ) for learning a qudit state. In addition, we also derive the covering 
number and the Rademacher complexity. Our results show that all three complexity measures are characterised by 
logarithmically proportional to the Hilbert dimension. 

Lastly, by formulating the quantum learning problems into Bloch-sphere representation, we show that it is equiva¬ 
lent to a neural network. Hence the classical ML algorithms can be practically applied to perform quantum ML tasks. 

There are several fields that may relate to or benefit from our work. 

Quantum State/Measurement Tomography. Quantum state tomography is a difficult task in physics because 
the number of unknown parameters in a multi-partite quantum system grows exponentially. Aaronson pointed out 
that quantum ML can serve as an alternative approach to quantum state tomography [39] . Surprisingly, learning an 
unknown target state within a given accuracy requires only the number of measurements that grows logarithmically 
with the dimension d. In this work, we push Aaronson’s result one step further and consider application of machine 
learning framework to study quantum measurement tomography. To the best of our knowledge, there are very 
few results in this direction. We hope that our result in learning quantum measurements will stimulate further 
investigation into this problem. 

Quantum State Discrimination. The goal of quantum state discrimination is to determine the identity 
of a state in an ensemble. Whenever states are not mutually orthogonal, they cannot be perfectly discriminated. 
Therefore, a possible way is ambiguous state discrimination with the goal of minimizing the error probability. Given 
e > 0 we show that the fat-shattering dimension guarantees that a set of quantum states can be discriminated into 
two subsets with the worst error probability no greater than 1/2 — e. Following the same reasoning, the quantum 
states in the hypothesis set can be used to distinguish between two-outcome measurements. 

Quantum Random Access Codes. The (n,m,p)- QRA coding stands for encoding an n-bit sequence into 
m-qubit so that the receiver can recover any one of the bits with successful probability at least p. The information- 
theoretic inequalities of n and m provide an upper bound for the fat-shattering dimension of learning quantum states. 
Alternatively, we can use the complexity measure—pseudo dimension—to show that there exists no (n,m,p)~ QRA 
coding scheme, with n > 2 2m . The result coincides with the work of Hayashi et al. [59]. See Section 5.4 for further 
discussions. 

The paper is organised as follows. In Section 2 we introduce the background of statistical learning theory 
(especially on supervised learning) and describe important complexity measures. In Section 3, we formalise a 
unified framework to relate the problems of learning quantum measurements and learning quantum states with the 
learning real-valued functions. Based on the proposed approach, we derive learning quantum measurements and 
prove the main results in Section 4. In addition, we discuss the interpretations of the to ambiguous set discrimination 
and also derive the covering numbers and the Rademacher complexity. In Section 5, we consider the problem of 
learning quantum states and discuss its relationship with QRA codes. In Section 6, we formulate the learning 
problem into Bloch-sphere representation and discuss possible algorithms (e.g. neural networks) to implement the 
quantum learning tasks. We conclude this paper in Section 7. 
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Notation. In this paper, we denote a Hilbert space by 7i. The trace of an operator M on T-L is calculated as 

Tr(M) :=^e fe Me fc , 

k 

where {e^} is any orthonormal basis on 7~L. Let M,j denote the set of all self-adjoint operators on C d . The Hilbert- 
Schmidt inner product on can be defined as (A, S)hs : = Tr (AB), where the subscript ‘HS’ will be omitted 
when the context is clear. For p £ [l,oo), we denote the Schatten p-norm of an operator M as 

||M|| p := (^|A i (M)r) 1/P , 

' i> 1 ' 

where A i(M) is the eigenvalue of M. We denote HMH^ := sup, ; |Aj(M)| as the operator norm. Clearly, || • ||! and 
|| • ||2 correspond to the trace norm and Hilbert-Schmidt norm || ■ ||hs respectively. Slightly abusing the notation, 
we also denote the conventional t p norm on R d by || • || p for p £ [1, oo]. We define the unit ball associated with the 
Schatten norms as S d = {M £ M d : ||M|| p < 1}. The set of bounded operators on H is denoted as B(7i), which is 
the set operators with finite Schatten oo-norm. Likewise, the set of operators with finite Schatten 1-norm is called 
the set of trace class operators, TifH). 

A quantum state (also called density operators ) on the Hilbert space H is a positive semi-definite operator with 
unit trace. We identify the state space as the set of all quantum states on 71, i.e. , 

Q(7i) := {p £ T(H) : p h 0 , Tr(p) = 1 }. 

A positive operator-valued measure (POVM) on 7i is a finite set of positive semi-definite operators {Hj}i g j such 
that 

ie/ 

where I denotes the identity operator on 7i. Each POVM element is called a quantum effect , which serves as 
an instrument to perform a yes-no measurement. We denote the set of all effects as an effect space: 

£(U) := {E £ B(7V) : O^E^l}. 

All constants are denoted as C or c and are independent from other parameters. Their values may change from 
line to line. The notation A< B means there is a constant c such that A < cB and A ~ B means both A< B and 
A > B. We summarise all the notation in table 2 in Appendix A. 

2. Background of Statistical Learning Theory 

The starting point of this section is the mathematical formalism of the supervised machine learning. We describe 
the effectiveness of a learning machine and examine the number of samples required to produce an almost optimal 
function with an error rate below the desired accuracy. As will be shown later, the bound of the sample complexity 
is closely related to the measures of complexity which characterise the “size” of a function class. 

2.1. Supervised Machine Learning. Generally speaking, supervised learning is a ML task that infers a function 
(or a learning model) by observing the data and the response to the data. In this work, we focus on the definitions 
of agnostic PAC learnability and sample complexity for supervised machine learning. For more comprehensive 
introduction to AIL, we refer the readers to literature such as Refs. [2, 60-65]. 

Consider a probability space (Z,p), where Z := X x y with X (called the input space ) a measurable space and 
y (called the output space) a closed subset of real line M. The probability distribution p over Z is assumed to be 
fixed but known only through the training data set , i.e. Z„ = {(Ad, Pi),..., ( X n , Y n )} £ Z n sampled independently 
and identically according to the measure p. Supervised learning aims to construct a function / : X —» y which 
approximates the functional relationship between the input variable X £ X and the output variable Y £ y from 
the observed training data set. To evaluate the performance of the approximation, we define the loss function as a 
measurable map £f : Z —> [0,+oo) and the expected risk (also called the out-of-sample error): 

L(f)=Eff f (X,Y). 

The loss function is usually taken as the absolute error or square error, i.e. 

i f (X,Y) = | f(X) - Y\ or e f (X,Y) = (/(A) - V) 2 . 
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For convenience, we only consider the square error in this work. Other loss functions that satisfy the Lipschitz 
condition can be easily generalised 3 . 

Since we are interested in minimising the expected risk, hence the target function (or Bayes function) as t(x) = 
E[Y|X = x] can be defined to achieve the minimum expected risk (called the Bayes risk), i.e. 

(2.1) L Bayes :=L(t) = ini L(f), 

where the infimum is taken over all possible measurable functions from X to y. When y is a deterministic function 
of X, then Y = t(X) almost surely and L(t) = 0. 

The goal of the learner is to identify the target function t from a collection of functions T, called the hypothesis 
set 4 , which is a set of real-valued functions defined on the input space X. A learning algorithm A for hypothesis 
set T is a mapping that assigns to every training data Z n some candidate function A(Z„) £ T, i.e. 

A : U^ =1 Z” T. 

The effectiveness of the learning algorithm is measured by the number of data required to produce an almost optimal 
function in the sense of Eq. (2.1). Therefore, we introduce one of the most fundamental concepts in supervised 
machine learnings Agnostic Probably Approximately Correct (PAC) learning model [4, 68]: 

Definition 2.1 (Agnostic PAC Learnability [65], Def. 3.3). A hypothesis set T is agnostic PAC learnable if there 
exist a function mjr and a learning algorithm with the following property: For every e,5 £ (0,1) and 

for every distribution p over Z, when running the learning algorithm on n > mj=(e, S) samples generated by p, the 
algorithm returns a hypothesis / such that, with probability of at least 1 — S (over the choice of the n training 
examples), 

L(f)< inf : L(f)+e. 

However, the expected risk L(f) = E^if^X, Y)] cannot be calculated since p is unknown. We can only evaluate 
the agreement of a candidate function over the training data set, which is called the empirical risk (also called the 
in-sample error): 

1 - 

L n(f) = 

1 i=1 

For example, one of the most well-known learning algorithms is the Empirical Risk Minimization (ERM) principle 
[2] that assigns a function /„ £ T to each training data set which is “almost optimal” on the data, i.e. 

(2.2) /„ = arg min L n (f). 

J t./ 

One way to evaluate the performance of the learning algorithm is to relate the risk L(f n ) to the empirical risk 
Ln(fn)- Following the reasoning of agnostic PAC model, our goal is hence to estimate the generalisation error e: 

L(f n ) <L n (fn) + e(n,X). 

For any algorithm that outputs a f n £ J-, we have 

L{fn) ~ Ln(fn) < SUp{L(/) - £„(/)}, 

fer 

which leads to the definition of uniform Glivenko-Cantelli class (uGC class). 

Definition 2.2. We say that the hypothesis set J 7 is a uniform Glivenko-Cantelli class if for every e > 0, 


lim sup Pr 

n—¥ oo ^ 


f 


< sup 
[fer 

L{f)-L n (f) 



= 0. 


^ A loss function if : Z (0,oo) is a Lipschitz function if it satisfies the Lipschitz condition 

K f(X, Y) - t g (X, Y) I < L|/(X) - g(X)\ 

for all possible (X, Y) E Z and the quantity L G 1 is called the Lipschitz constant. Denote by ij: the set {if : f E X}. Then the 
complexity measures (e.g. the covering number and Rademacher complexity) of the class ijr are different from that of the hypothesis 
set T by the Lipschitz constant L [66, 67], i.e. 

J\f p (e,£jr, m) < jV* p (e/L,7-', m) for p > 1, m E N 

and 

Kn(lr) < LftnCF). 

Therefore, by homogeneity we may assume the loss function is the absolute error with L = 1 or the square error L = 2 for deriving the 
sample complexity problems. 

^Note that we use the term ‘hypothesis set’ and ‘function class’ interchangeably throughout the paper. 
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The uniformity is with respect to all members of F and over all possible probability measures p on the domain 
Z. In addition to the conditions of the learnability, we also consider the bound on the rate of uniform convergence. 
For every 0 < e, <5 < 1, let mjr(e, 8 ) be the first integer such that for every n > 8) and any probability measure 


(2.3) 


Pr 


< sup 

L(f)-L n (f) 





< S. 


The quantity mr(e,8) satisfied Eq. (2.3) is called the (Glivenko-Cantelli) sample complexity of the hypothesis set 
F with accuracy e and confidence 5. The sample complexity encapsulates the number of samples required to learn 
a set of functions. 

Vapnik studied the relation between the uGC class and learnability [1, 2, 69] and showed that if a hypothesis set 
F is a uGC class, then it is sufficient for the agnostic PAC learnability 5 . 


Theorem 2.1 (Uniform Convergence [65, Corollary 4.4]). A training data set Z n is called e-representative (with 
respect to domain Z, hypothesis set F, loss function t, and distribution p) if 


V/6J, 


L n (f)-L(f) 


< e. 


Then, for every e,S £ (0,1) and every probability distribution p over Z, a uGC class F that guarantees an e 1‘1- 
representative set with probability of at least 1 — 5 is agnostic PAC learnable. Furthermore, the ERM algorithm is 
an agnostic PAC learner for F. 


As a result, we consider the generalisation error e(n, IF) and the sample complexity rnpe, 8) of the hypothesis set 
F as the performance criterion to investigate whether the underlying learning problem is agnostic PAC learnability. 

In summary, the fundamental problems in ML are two-fold. The first is under what conditions the machine is 
agnostic PAC learnable. Secondly, the sample complexity determines the rate of the uniform convergence and the 
information-theoretic efficiency of the hypothesis set F. In the next subsection, several complexity measures are 
introduced to characterise the “richness” or “effective size” of the hypothesis set. In Appendix B, we show that the 
sample complexity can be further expressed in terms of the complexity measures. 


2.2. Measures of Sample Complexity. As discussed before, we are interested in the parameters which effectively 
measure the size of a given hypothesis set. There are some well-known measures of (information) complexity 6 of 
the function class: combinatorial parameters , covering numbers, and Rademacher complexity. 

The first combinatorial parameter— Vapnik- Chervonenkis (VC) dimension —was introduced by Vapnik and Cher- 
vonenkis [72] for learning Boolean functions. 

Definition 2.3 (VC Dimension). Let F be a set of {0, l}-valued functions on a domain X. We say that F shatters 
a set {xi,..., x n } C X if for every subset B C {1,..., n} there exists a function fs £ F for which fsixf) = 1 if 
i £ B, and fsi^i) = 0 if i B. Let 

VCdim(J r ) = sup{|<S| : S C X, S is shattered by F} . 

The VC dimension of F (on the domain X) is denoted as VCdim(J r ). 

Pollard [73] generalised the concept of VC dimension and introduced the pseudo dimension to quantify the 
sample complexity of a real-valued function class. The parameterised version of Pollard’s pseudo-dimension is the 
scale-sensitive dimension (also called the fat-shattering dimension ) introduced by Kearns and Schapire [74]. 

Definition 2.4 (Pseudo Dimension). Let F be a set of real-valued functions on a domain X. We say a set 
S = {x\,... ,x n } C X is pseudo-shattered by F if there exists a set {ai}" =1 such that for every B C {1,... ,n} 
there is some function fs £ F for which /b(x*) > on if i £ B, and /s(^i) < on if i B. Define the pseudo 
dimension of F as 

Pdim(J r ) = sup {|vS| : S C X, S is pseudo-shattered by J 7 }. 
fs is called the shattering function of the set S. 

There is a desirable property of the pseudo dimension that will be useful in our main theorems. 

Theorem 2.2 (Pollard [73]). 


^ Agnostic PAC learnable is also called learnable with ERM , or we can say that the ERM algorithm is consistent. Recent works consider 
the stability issues of the learning algorithm as one of the criterion of learnability. However, in this paper we do not deal with issues of 
stability and refer interested readers to Refs. [70, 71] and the references therein. 

®The complexity measures introduced in this section and the generalisation bounds derived in Section B are information-theoretic in 
the sense that the learning algorithms are based on the agnostic PAC model regardless of the computational resources. 
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(i) If T is a vector space of real-valued functions then Pdim(F) = dim(F). 

(ii) If T is a subset of a vector space T' of real-valued functions then Pdim(F) < dim^J 7 '). 

Definition 2.5 (Fat-Shattering Dimension). Let T be a set of real-valued functions on a domain X. For every 
e > 0, a set S = {x\,... ,x n } C X is said to be e-shattered by the J 7 if there exists a set {ai}" =1 C R. such that 
for every B C {1,..., n} there is some function fs £ T for which fsixf) > a.i + e if i £ B, and fsixi) < a* — e if 
i B. Define the fat-shattering dimension of J- on the domain X as 

fat^-(e, X) = sup {|vS| : S C X, S is e-shattered by J 7 } . 

fs is called the shattering function of the set B and the set {ai }" =1 is called a witness to the e-shattering. When 
the underlying space is clear, we denote it by fatjr(e). If the witness set {«i} are all equal to a constant, we call it 
as the level fat-shattering dimension, fatx-(e). 

In Ref. [61], a relationship between the fat-shattering dimension and the pseudo-dimension can be given. 

Theorem 2.3 (Anthony and Bartlett [61]). Let J 7 be a set of real-valued functions. Then: 

(i) For all e > 0, fatjr[e) < Pdim(F). 

(ii) If a finite set S is pseudo-shattered then there is cq such that for all e > eo, S is e-shattered. 

(iii) The function fatjr(e) is non-increasing with e. 

(iv) Pdim(F ) = lim^o fatj^(e) (where both sides may be infinite). 

Note that it is possible for the pseudo-dimension to be infinite, even when the fat-shattering dimension is finite 
for all positive e. 

In addition to the combinatorial parameters bounding the sample complexity, there are other quantities called 
covering number which measure the size of the function class by the finite approximating set. The concept of 
covering number dates back to Kolmogorov et al. [75] and has been used in many areas of mathematics. 

Definition 2.6 (Covering Number). Let (Y, d) be a metric space and let T C Y. For every e > 0, the set 
{yi,... ,y n } is called an e-cover of T if every / £ T has some yt such that d(f,yi ) < e. The covering number 
Af(e, J 7 , d) is the minimum cardinality of a e-covering set for J 7 with respect to the metric d. 

To characterise the size of the function class J 7 in machine learning, we investigate the metrics endowed by the 
samples; for every sample {aq,... ,x n } £ X, let y rl = n^ 1 X]T=i be the empirical measure supported on that 

sample. For 1 < p < oo and a function /, denote \\f\\L p (^ n ) = (« _1 X^=i \.f( x i)\ p ) 1/P an d ||/||oo = maxi<j<„ \f{xi)\. 
Then, J\f (e, T, L p (p n )) is the covering number of J 7 at scale e with respect to the L p (/ji n ) norm. 

Definition 2.7 (Entropy Number). For every class J 7 , 1 < p < oo and e > 0, let 

N p (e, J 7 , n) = supAf (e, F, L p (y n )), 

Mr. 

and 

A f p (e,F) = sup sup A f (e,F, L p (y n )). 

n fi n 

We call log Al p (e, J 7 , n) the entropy number of T with respect to L p (p n ) and log A f p (e,F) the uniform entropy 
number. 


The significance of the uniform measures of complexity (i.e. uniform entropy number and combinatorial param¬ 
eters) lies in that they can characterise the uGC class. However, the bounds are loose. Bartlett and Mendelson 
[67] considered the techniques of concentration of measures for empirical processes and proposed a random aver¬ 
age quantity —Rademacher complexity , which capture the size of the uGC class more directly and leads to sharp 
complexity bounds. 


Definition 2.8 (Rademacher Complexity' [67, 74, 76]). Let y be a probability measure on X and A" be a set of 
uniformly bounded functions on X. For every positive integer n , define 


U n {F) = Esup — 

feJ 7 V n 




i=l 


where {aq }™ =1 are independent random variables distributed according to y and { 7 independently takes values 
in {—1,+1} with equal probability (which are also independent of {aq}” =1 ). The quantity TZ n (F) is called the 
Rademacher complexity associated with the class T. 




We remark that the complexity measures can be related among each other [77-79]: 

fatyr(e) ^ logW 2 (e, T,n) < ' R ' n ^ < fatjr(e) • log . 

To sum up the results we have presented so far, the complexity measures, such as the combinatorial parame¬ 
ters (e.g. VC dimension and fat-shattering dimension), covering numbers and the Rademacher complexity of the 
hypothesis set control the rate of uniform convergence. By computing those quantities of the given hypothesis set 
and according to Eqs. (B.l), (B.2), (B.3) and (B.4), we can estimate the bounds on the sample complexity of the 
learning problems. 

3. The Framework for Learning Quantum Measurements and Quantum States 

In this section, we unify the two quantum learning problems at hand into learning linear functionals. In Section 
3.3, we justify the proposed quantum learning model in practical situations. 

3.1. Quantum Learning Problems as Linear Functional on Matrices. Recall that a physical theory aims to 
predict events observed in the experiments by describing three types of apparatus: preparation, transformation, and 
measurement. The preparation process of a system can be embodied by a state, while an effect is a measurement 
that produces either ‘yes’ or ‘no’ outcomes in order to observe the physical experiment. However, according to 
the statistical nature of Quantum Theory, only probabilities of the occurrence can be predicted (counting multiple 
measurements). More precisely, assume that a system is prepared in the state p £ Q(TL). Then the outcome of 
every two-outcome measurement E £ £ {[H) takes the form of the probability distribution: 

f E (p) = Tr(Ep) = (E,p)£[0,l]. 

Note that it is a linear functional on the state space, i.e. /e : Q([H) —» R. In the theory of learning, such [0, l]-valued 
functions are called probabilistic concepts [74]. 

The following proposition establishes the one-to-one correspondence between /e E. 

Proposition 3.1 (The Correspondence between Two-Outcome Measurement and Linear Functional). [80, Prop. 2.30] 
Given a Hilbert space TL, let f e be an effect, i.e. a linear map from Q([H) to the interval [0,1]. Then there exists a 
bounded operator E £ £(TL) such that 

f E (p) = Tr(Ep) = (E,p) Vp £ QiTL). 

Furthermore, the operator E is unique in the following sense. Let E±, E 2 £ £([H). If {p, E\p) = (p, E^p) for every 
\p) £H, then Ei = E 2 . 

The proposition states that every two-outcome measurement can be identified as a linear functional on the state 
space. Consequently, the problem of learning an unknown (two-outcome) quantum measurement is equivalent to 
learning a real-valued linear functional on quantum states. Here and subsequently, we call an effect to represent 
either the linear functionals on Q([H) or the two-outcome measurement E £ £{H). 

Conversely, if the measurement apparatus is chosen as some E £ £([H), then the measurement outcome of every 
state p is distributed as 

f P {E) = Tr (Ep) = ( E,p ) € [0,1] Vp £ Q(H). 

Therefore, we take the state space as the set of linear functionals on the effect space by the following proposition: 

Proposition 3.2 (The Correspondence between Quantum State and Linear Functional on Effect Space). [81] Given 
a Hilbert space TL, let f p be probability measure on £(TL). Then there exists a quantum state p £ Q(TL) such that 

fp(E) = Tr(Ep) = (E, p) \/E £ £(H). 

Furthermore, different pi, p 2 £ QfH) determines different probability measures, i.e. there exists an operator E £ 
£([H) such that Tr(Epi) 7 ^ Tr(Ep 2 ). 

Similarly, according to the one-to-one correspondence between p /„, learning an unknown quantum state 
coincides with learning a real-valued linear functional on the effect space. 


^ Some authors define the Rademacher complexity with the normalisation term as n rather than ffn. Here we follow the notation used 
in Ref. [76], which is more convenient to bound the sample complexity (e.g. Eq. (B.4)). 
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3.2. Learning Linear Functionals on Banach Space. In the previous section, we establish the relationship 
between quantum measurements/states and linear functional on matrices. By the duality theorem (see Theorem 
3.1 below), the two quantum learning problems can be unified into the problem of learning the membership in a 
Banach space. Furthermore, the real-valued function that associates with the target quantity in the Banach space 
is isomorphic to the linear functional on the input space, i.e. an element in the dual space of the input space. For 
example, assume the input space is the unit ball of the Schatten p-class, i.e. X = S d . Then the hypothesis set can 
be represented as the linear functionals that are polar 8 to S d , i.e. for all x £ S d and 1/p + l/q = 1, 

E={x^(E,x):EGS d q } = (S d )°. 

Under this duality formalism, the problems of estimating the complexity measures of the subset in a Banach space 
can be transformed into the following question: Whether a set of linear functionals is agnostic PAC learnable? 


Theorem 3.1 (Duality of Bounded Operator and Trace class). [82, Theorems 19.1 and 19.2] Fix a Hilbert space 
TL. The map E i—>• f E is an isometric isomorphism from the space of bounded operators, B{TL), to the dual space of 
the set of trace classes operators, T(TL)*. Conversely, the map p >->• f p is an isometric isomorphism from T(]H) to 
B(H)*. 

Mendelson and Scliechtman [83] first investigated the fat-shattering dimension of sets of linear functionals on 
Banach space and proposed the following useful result. 


Lemma 3.1 (Mendelson and Schechtman [83]). The set S = {aq,... ,x n } C B\ is e-shattered by Bx* if and only 
if {aq}" =1 are linearly independent and for every oq ,..., a n £l, 


‘E w - 


E 


aiXi 


where Bx is the unit ball of some Banach space X and Bx* is its dual unit ball. 


By restricting the values of the set {ai}™ = i to {+1, —1}, the core idea of Lemma 3.1 is to calculate the Rademacher 
series on the Banach space, where the n points Rademacher series on X is defined as Ti x u where {7i}" = i are 
the symmetric {+1, —l}-valued random variables. Additionally, with the following duality formula for the Schatten 
p-norm, we can estimate the range of the linear functional, which will helpful to further derive the complexity 
measures. 


Theorem 3.2 (Duality Formula for ||A|| p ). [84, Theorem 7.1] For all p> 1, define q by l/q + l/p=l. Then for 
all A e M d , 

WMp = sup {Tr (BA) : ||£||, = 1} . 

-BeMU 

The techniques from Mendelson and Schechtman (Lemma 3.1) and the duality formula (Theorem 3.2) can be 
used to upper bound the fat-shattering dimension and the Rademacher complexity via the Rademacher series. What 
remains is to compute the Rademacher series on the Banach space for both complexity measures, and we leave the 
details to Sections 4 and 5. 


3.3. The Justification of the Quantum Learning Model. Before proceeding to derive the complexity mea¬ 
sures, we first address two practical issues that may arise in our quantum learning setting: (1) Only the ‘yes’ 
(‘1’) or ‘no’ (‘0’) outcome can be observed rather than the outcome statistics 9 . (2) The measurement apparatus is 
not perfect (e.g. there are measurement errors in the training data set). However, we will show that the sample 
complexities of the two scenarios remain the same (up to a Lipschitz constant). 


The output space consists of binary measurement outcomes rather than measurement statistics. In 

this case, the training sample (Aj,Y)) equals to (Xi, 1) with probability Tr(nXj), and (Aj,0) with probability 
1 — Tr(nA'j). We show that the covering number remains the same as the training sample (Aj, Tr(nXj)) considered 

8 In convex analysis, a convex body K C R n is a convex compact set with nonempty interior. The gauge of a convex body K , also known 
as the Minkowski functional, is defined by | [ x | [ 1 ,' '■= infjf > 0 : x S tK}. If K is symmetric with respect to the origin (— K = K), then 
K is a unit ball associated with the norm || ■ |lif an <! the inner product (•,•)• We define the polar of K as 

K° = \x S R" : sup (k, x) < 1 
l kGK 

In the symmetric case, K° is the unit ball of the dual space of (R", || • | /,'). Here, Sf is a unit ball of Schatten 1-class and is a unit 

ball of Schatten oo-class. Considering the Hilbert-Schmidt inner product, Sf and S’A are polar to each other. 

9 The situation can also occur when only one measurement is performed. 
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in the quantum machine learning setting. Other complexity measures easily follow by the same argument. Assume 
the underlying loss function if satisfies the Lipschitz condition, i.e. there exists L > 0 such that 

(3.1) \£ f (X,Y)-e g (X,Y)\ <L\f(X)-g(X)\. 

By denoting p x = Tr(ILX'), then the expected risk can be expressed as follows 

L(f)=E ll £ f (X,Y) 

= E x E Ylx £ f (X,Y) 

= E x \p x £ f (X, 1) + (1 -p x )g f (X,0)\ 

=: E x £' f (X,Y). 

In the third equality we use the fact that the ‘1’ (resp. ‘0’) outcome occurs with probability p x = Tr(ILY) (resp. 1 — 
p x ). In the last line we introduce the induced loss function £'f(X, Y) := [ p x lf[X , 1) + (1 — p x )if{X , 0)]. Then for 
all X G X 1 the distance between i’j and i’ g can be calculated as 

\i'f(X, Y) - i' g (X, Y )I = I p x (£ f (X, 1) - i g (X, 1)) + (1 - px) (£f(X, 0) - i g (X, 0))| 

< PX MX, 1) - i g (X, 1)1 + (1 - px) MX, 0) - £g(X, 0)1 

< Px ■ L | f(X) - g(X )| + (1 - px) ■ L | f(X) - g(X)\ 

= L\f(X)-g(X)\. 

The second inequality follows from the triangle inequality. The next line is due to the Lipschitz condition. The 
above relation shows that the distance \i'f — i' g \ can be upper bounded by L| / — g\, which is exactly the same as 

the upper bound for | if — i g \ (see Eq. 3.1). Recall Definition 2.6, it is clearly that the covering numbers with 

respect to the induced loss function and the original loss function are bounded by the same quantity. Therefore, 
the generalisation error, Eq. (2.3) and the sample complexity do not change in this scenario. 

There is noise involved in the measurement procedure. In this case, we assume that the training sample 
is [X, Y + n), where Y = Tr(ILY) and n is a random variable that models the measurement error. Following the 
same reasoning, we can calculate the expected risk as follows 

L(f) = E fl i f (X,Y + n) 

= E x E a i f (X,Y + n) 

=:E xi' f {X,Y). 

In the last line, we let i'f(X,Y) := E n if(X,Y + n). Thus, 

\t' f (X, Y) - l' g (X, Y) | = \E n i f (X, Y + n) - E n i g (X, Y + n) | 

<L|E n [f(X)-g(X)}\ 

= L\f(X)-g(X)\. 

Therefore, the original complexity measures (which depends on the distance of the loss function) and the induced 
sample complexity hold the same. 


4. Learning Quantum Measurements 

In this section, we follow the quantum learning framework presented in Section 3 and explicitly show how to 
derive the upper bound for the fat-shattering dimension, Rademacher complexity and the covering/entropy number. 
We then discuss how these complexity measures relate to quantum state discrimination. 

Recall that, in the problem of learning an unknown quantum measurement, the goal is to learn a fixed but 
unknown effect II G £(C d ) through the training data set is Z n = {(pi, Tr(IIpi))}" =1 , where {/0;}r=i G Q(C d ) = X 
distribute independently according to the unknown measure p. Note that learning II is equivalent to learning a 
two-outcome POVM {II, I — II}. Due to the correspondence between an quantum effect E G £(C d ) and the linear 
functional /e '■ P {E,p) on the input space X (Proposition 3.1), we consider the hypothesis set that consists of 
all quantum effects 10 ; that is, 

X={f E : Ee£( C d )}. 


10 The hypothesis set can be chosen as a subset of the effects space, to which the target effect II may not belong. Then the goal is to 
choose an effect in the hypothesis set that approximates the target well. We discuss this issue in Section 6. Also note that we sometimes 
denote T as the subset of S(C d ) and sometimes denote it as the linear functionals formed by that subset. 
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In the following, we present our main result to the question: “how many quantum states are needed to learn a 
quantum measurement?'" This is exactly the sample complexity problem introduced in Section 2.1. To tackle this 
problem, we have to estimate the complexity measures that characterise the size of the hypothesis set. 


4.1. The Fat-Shattering Dimension for Learning Quantum Measurements. Our first step is to use a 

common trick in convex analysis; namely, “symmetrisation” of the state space and the effect space, to embed them 
into a subset of the Banach space. In other words, the symmetric convex hull of the state space forms a unit ball 
of Schatten 1-class: 

Sf := conv(-Q(C d )UQ(C d )), 

where conv(-) denotes the convex hull operation. Similarly, we have 

St := conv(—£(C d ) U £(C d )). 

Now the input space X C Sf and the hypothesis set T consists of linear functionals which can be paremeterised by 
the elements in That is, 

T = {f E : EeSt}. 

The main reason for introducing Sf and St is that they are unit balls which are polar to each other (through the 
Hilbert-Schmidt inner product). Thus, we can apply Mendelson and Schechtman’s result (Lemma 3.1) to estimate 
the fat-shattering dimension. 

The following is our main result in this result. 


Theorem 4.1 (Fat-Shattering Dimension for Learning Quantum Measurements). For all 0 < e < 1/2, and integer 
d> 2, we have 

Pdim(£(C d )) < d 2 , 

and 

fat £{cd) (e, Q(C d )) = min {0(d/e 2 ),d 2 }. 


Proof. We first present the outline of the proof. According to the definition of the fat-shattering dimension, it follows 
that the function fatjr(e) is non-increasing with e. Hence, our first objective is to check whether the fat-shattering 
dimension is unbounded. Equivalently, it suffices to find the pseudo dimension which bounds the fat-shattering 
dimension (Theorem 2.3). Secondly, assume there is a set of n points that can be e-shattered; we will find an 
inequality to relate n with e, which will prove our claim. 

(i) Pseudo Dimension: Since is a vector space with dimension d 2 and St is a subset of M^, we can embed 
St into a real vector space of dimension d 2 . Since the function class T is a subset of a d 2 -dimensional vector space, 
by Theorem 2.2 we obtain Pdim(J r ) < d 2 . 

(ii) Fat-Shattering Dimension: Consider any set S = {x \,..., x n } C Sf is e-shattered by Sf^, where n < d 2 . 
Denote a Rademacher series as E]=i li x ii where { 7 i}/ = i are independent and uniform {+1, —1} random variables 
(also called Rademacher random variables). By selecting = 7 j in Lemma 3.1, we have 


(4.1) 


en < 




1=1 1 

We adopt a probabilistic method to upper bound the right-hand side of Eq. (4.1). If we can find a quantity C(n, d) 
that upper bounds E ||E”=i Ti x % Hi) then there is a realization of {7i}/=i such that \\Y^i=i"U x i\\i — C(n,d). As a 
result, it remains to find an upper bound for the expected norm of the Rademacher series E ||X^=i li x i\\\- 

In order to upper bound the Rademacher series, we need the powerful Noncommutative Khintchine Inequalities 
[58]: 


Proposition 4.1 (Noncommutative Khintchine Inequalities [58, 85]). Let {a.' ? ;}” =1 be deterministic dx d matrices, 
{ 7i }? = i be independent Rademacher random variables. Then 

| (ll(i:2 = i*«4) 1/2 ll? + ll(E^i*J*i) 1/2 ll?) /P > */ 2 <p < 00 

jinf Xi=ai+6i (IKELi a i a l) 1/2 \\p + IKEIU b \ b i) 1 / 2 Wp) /P , if 1 < P < 2- 

where means that the equality holds up to an absolute constant depending on p, and t denotes the complex 
conjugate operation. 

Note that Haagerup and Musat [85] proved that the result also holds as { 7 i}/ =1 are independent standard complex 
Gaussian random variables 



12 










Invoking Proposition 4.1, we have 



n 


/ „ \ i/j 

E 

E T' iXi 

< 

fed 


2 — 1 

1 

\*=1 / 


Since the square operation preserves Sf, i.e. x\ £ Sf, for all Xi £ Sf, by the convexity of Sf, we have ^ Xfe=i x \ € Sf ■ 
Then the problem is reduced to finding 


max i Jn 
{xi}esf 


E- 


1/2 


which is essentially a convex optimisation problem 


max feiHvfeli, 
x-esf 


max y/n'y^ i fe|Ai|, subject to En = 1 - 

X(zS-t - n 

1 3 =1 


Since the square root is concave, we attain the maximum when \Xj = 1/d, for j = 1,... ,d. That is, 


(4.2) 


E 



< max 
x£S<[ 


d 

VnY, = '/nd. 

i —1 


Consequently, there is a realization of {7 i }" =1 such that IEILi 7* a: ®lli — feid, Vcc* £ Sf. Combined with Eq. (4.1), 
we have n < d/e 2 which proves our claim. 

□ 


In the following proposition, we will demonstrate that the upper bound is tight. 

Proposition 4.2. Considering a Hilbert space C d , there exist infinitely many sets of d quantum states that can be 
1/2 -shattered by the effect space. 

Proof. Consider arbitrary d mutually orthogonal rank-1 projection operators (pure states) {pj}f =1 on C d as the 
input states. Now for every B C {1,..., d}, denote fs ■ p —> (J2ieB Pi>P)> f° r some p £ Q(C d ). Note that one can 
easily check Pi e £(C d ). Then for i £ B, we have 


fB (.Pi) = l^2Pi,P 

\ieB 

= {Pi ; Pi) 

= 1 . 

Similarly, /s(pi) = 0 if i ^ B. As a result, {pi}f =1 is 1/2-shattered by {/b}. □ 

4.2. The Rademacher Complexity. Following the paradigm in Section 4.1, we calculate the Rademacher com¬ 
plexity of the effect space £(C d ) via the duality formula, Theorem 3.2, and the noncommutative Khintchine in¬ 
equality, Proposition 4.1. 

Theorem 4.2 (Rademacher Complexity for Learning Quantum Measurements). Assume the input space is the 
state space X = Q(C d ) and the hypothesis set T = {/b : VE £ £ (C d )}. Then the Rademacher complexity is 

n n (£(c d )) = o (Vd). 
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Proof. Recall the definition of the Rademacher complexity (Definition 2.8). We have 


y/nn n (S^) = E sup 


egS* 


= E sup 
E&S± 


= E sup 


< E 




2=1 


^7 i(E,Xi) 


E, ^ 7 &i 


^2 i iXi 

2=1 

< Vrid. 


The third line is due to the duality formula (Theorem 3.2), and the last relation follows from Eq. (4.2). This 
completes the proof. □ 


4.3. The Entropy Number. The covering number (and the related entropy number) follows directly from the 
Rademacher complexity by the Sudakov’s minoration theorm. 


Theorem 4.3 (Sudakov’s Minoration Theorem). [78, 86, 87] Let T be an index set. Let X = be a 

sub-Gaussian process 11 with L 2 -metric dx (i-e■ dx(s,t ) = ||X S — X t || 2y ) for s,t £ T). Then for each e > 0, 

e(log Af(e,T,d x )) 1/2 < CTEsup||*t||i, 

teT 


for some constant C. 


Corrollary 4.1 (Entropy Number for Learning Quantum Measurements). Assume the input space is the state 
space X = Q(C d ) and the hypothesis set JF = {/e ■ VE £ £(C d )}. Then for each e > 0, the covering number of the 
function class is 

logA f 2 (e,£(C d ),n) = 0(d/e 2 ). 


Proof. The upper bound of the empirical L 2 entropy number by the Rademacher complexity follows directly from 
the Sudakov’s minoration theorem. Denote the (vector-valued) stochastic process by 


X f := -t={ 7 1 f(xi),...,'y n f(x „)), 


where x±,... ,x n are independently drawn from X according to some distribution p. Then the distance measure 
can be calculated as 

1 ( n 

dx(f,g) = ||*/ - *J 2 = I J2 1-^) ~9(xi)\ 2 

V \ 2=1 

Invoke Theorem 4.3 and 4.2 to obtain 


1/2 


= Wf ~ 9 \\l 2 {^) ■ 


logA/"(e, X,L 2 (p n )) = log J\f{e,T,d x ) 

< C 2 ( Esu P/g^-||*/||i) 2 

e 2 

= C 2 ^»(*) 2 



Note that the right-hand side in the last line does not depend on the distribution fi. Hence the entropy number 
logA/ 2 (e, J 7 , n) = sup Mri logA/*(e, J 7 , L 2 (/i n )) = 0(d/e 2 ) follows. □ 


■^A stochastic process is called sub-Guassian if there exists cr > 0 such that Eexp(0Xt) < exp(cr 2 0 2 /2) for all 6 E M and t G 1~. Note 
that both Gaussian process and Rademacher process belong to sub-Gaussian process. 
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The pseudo dimension of the effect space Pdim(C d ) = d 2 means that we need d 2 parameters to exactly determine 
a POVM element. Note that it coincides the number of measurements in the quantum measurement tomography 
(since £(C d ) lies in a d 2 -dimensional real vector space). However, if we relax the criterion by tolerating an e accuracy, 
then the effect space can be covered by A/ 2 (e,£(C d )) = exp (d/e 2 ) balls each with radius e. In other words, we need 
log A /2 (e, E (C d )) = d/e 2 samples to identify which ball the target POVM element lies in. That is the meaning of the 
entropy number. By applying the quantum learning model to quantum measurement tomography, we can specify 
a “PAC” candidate POVM element with accuracy e and confidence S with only d/e 2 samples, which quadratically 
speed-up the original scheme. 


4.4. The Relationship to Quantum State Discrimination. Quantum State Discrimination studies how to 
optimally distinguish a set of quantum states according to a figure of merit [88, 89]. 

There are nevertheless some limitations in quantum state discrimination because the states cannot always be 
perfectly discriminated. Moreover, it may not be necessary to find the exact state in some scenario. Therefore, 
Zhang and Ying [90] considered quantum set discrimination , where the goal is to identify which set the given state 
belongs to. Now we relate the concepts of the fat-shattering dimension to quantum set discrimination. 

Definition 4.1 (e-separable Set). A set S = {x\,... ,x n } C is e-(linearly) separable with respect to the set 
W C Md if and only if for any subset B C S there exists an e-strip which separates B from its complement S\B. 
In other words, there exist w £ W and a £ R such that (w, x) > a + e/2 when x £ B and (w,x) < a — e/2 when 
x € S\B. 


It is not difficult to see that an 2e-separable set correspond to the task of quantum set discrimination with 
ensemble S = {x\,... ,x n }, where the error probability that a given state can be classified to a set is no greater 
than (1 — e)/2. One interesting question to ask is what the maximum cardinality of the 2e-separable set is. The 
following proposition shows that the fat-shattering dimension equals this quantity. 


Proposition 4.3. Denote the function class F = {p — > (E,p ) : E £ £(C d )}. Assume there exists a set S = 
{x \,..., x n } C Q{C d ) that is 2e-separable with respect to £(C d ). Then the maximum cardinality of the set S is 
fatjr(e). 


Proof. Recall from Definition 2.5 that the set S = {x\,... ,x n } is 2e-separable with respect to £(C d ) if and only if 
fat,(JO > n. Then the proposition is equivalent to show that fat c (F) = fat,OF). 

Because fat,(F) < fat e (F) by definition, it suffices to show fat, OF) > fat e (F). Given e > 0, choose a set 
S = {aq,. .., x n } with the largest integer n such that S is e-shattered by F (with {s;}/ =1 witnessing the shattering). 
Without loss of generality, we assume some Sj 7 ^ 1/2. We then choose an arbitrary subset B C {l,...,n} that 
contains i. By the definition of fat-shattering dimension, there exists Si := s(xi) such that there is some function 
Eb £ F for each set B C S so that ( Eb , xf) > Si + e, if i £ B. Also, we have (. E B ,x $) < s* — e, where B = S\B. 
Now denote Eb : = I — Eb such that 

(Fg, x/) 1 (F^, xfj P 1 A e. 


Since F is convex, set E' B 


\{Eb + Eb) £ F which satisfies 

(E' B ,Xi) > 1/2 + e. 


Similarly, let E' b :=I — E' B , we have 

{E'bi x() < 1/2 - e. 

The same argument holds for other Si ^ 1/2. It follows that the level fat-shattering dimension (witnessed by 1/2) 
also achieves the cardinality n of the e-shattered set, which completes the proof. □ 


5. Learning Quantum States 

In this section, we consider the problem of learning an unknown quantum state p' £ Q(C d ) through the training 
data set Z n = {(F i; Tr(p'F i ))}/ =1 , where {F i }" =1 £ X = £(C d ) are independently sampled according to an unknown 
distribution p!. By Proposition 3.2, the hypothesis set consists of the linear functional f p '.E 1 —>■ (E,p) on £(C d ): 

T' = {f p : Vp £ Q(C d )}. 

Similarly, we embed the input space into the unit ball of Schatten 00 -class, i.e. X = S^. Then the hypothesis set 
is the collection of linear functionals on the input space, i.e. Sf. In the following, we aim to calculate the complexity 
measures of S/ 7 which characterise the sample complexity of learning quantum states. It is interesting to see that 
the proofs derived in this section (i.e. the complexity measures of learning quantum states) parallel with that in 
the previous section (i.e. the complexity measures of learning quantum measurements) due to the duality relation 
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in Theorem 3.1. Finally, we discuss the relationship of the fat-shattering dimension with quantum random access 
codes. 


5.1. The Fat-Shattering Dimension for Learning Quantum States. Under the framework presented in 
Section 3, we characterising the input space X C S ^ and the hypothesis set T' consisting of the linear functionals 
with elements in Sf. That is, 

F' = {f p -. P & S d }. 

Therefore, we have the main result of deriving the fat-shattering dimension of the state space. 


Theorem 5.1 (Fat-Shattering Dimension for Learning Quantum States). For all 0 < e < 1/2 and integer d > 2, 
we have 

Pdim(Q(C d )) < d 2 - 1, 

and 

/at Q ( C <j)(e,£(C d )) = min{0(log d/e 2 ), d 2 - 1 }. 

Proof. Following the same fashion as in the proof of Theorem 4.1, we first estimate the pseudo dimension and then 
the fat-shattering dimension. 

(i) Pseudo Dimension: The state space lies in the set {.t £ M d : ||a;|| i = 1}, which is the sphere of Sf, i.e. Q(C d ) C 
dSf. Since dSf can be embedded into a real vector space of dimension d 2 — 1, we have Pdim(Q(C d )) < d 2 — 1. 

(ii) Fat-Shattering Dimension: For every {xj }" =1 £ S^,, we have to calculate the Rademacher series E 7***|loo- 

However, in the scenario of learning quantum states the input space lies in the Schatten oo-class. We have to esti¬ 
mate the spectral norm of the Rademacher series. Benefiting from the recent development of matrix concentration 
inequalities, Tropp [91] proved the following results: 

Proposition 5.1 (Upper Bound for Rademacher Series [91]). Consider a finite sequence {a;,} of deterministic Her- 
mitian matrices with dimension d, and let {qp be independent Rademacher variables. Form the matrix Rademacher 
series 

y = y^7 i x i- 

i 

Compute the variance parameter 

a 2 = a 2 (Y) = \\E(Y 2 ) |U. 

Then 

Eploo < log d. 

Note that the result also holds for the case {7,} being standard complex Gaussian variables. 


Invoking Tropp’s development of matrix concentration inequalities (see Proposition 5.1), we have 


(5.1) 



< \J 2 a 2 log d, 

OO 


where rr 2 := 


E(Er=i7 iXiY 


. Straightforward computation shows that 

OO 


cr 2 = 

e ( E 7***) 2 

= 

E H TiTjXiXj 

= 

n 


\*=i / 

OO 

V / 

OO 

2—1 


We get 


E 


n 

2 — 1 


< sj 2 n log d. 

OO 


Then there is a realization of { 7 i }” =1 such that ||E"=i li x i\\oo < V^ n log d, Vxt £ S 

From Lemma 3.1, by selecting aj = 7 j, en < |!E"=i 7***1! ■ Combining the inequalities, we have n < 0(logd/e 2 ) 
completing the proof. □ 
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5.2. The Rademacher Complexity. By repeating the procedure introduced in Section 4.2, we can compute the 
Rademacher complexity of the state space. 


Theorem 5.2 (Rademacher Complexity for Learning Quantum States). Assume the input space is the effect space 
X = £(C d ). The hypothesis set T defined on X is the state space Q(C d ). Then the Rademacher complexity of 
hypothesis set is 

n n (Q(c d )) = o(0ogd). 

Proof. Recall from the definition of the Rademacher complexity. We have 


V n1Z n (Sf ) = E sup 
pesf 


= E sup 
pesf 


= E sup 
pesf 


lifp( E i) 

i= 1 
n 

i ( E i’P) 

i =1 

('^Z'YiEhP 


< E 


^TiEi 


< \Jn log d. 

The forth line is due to the duality formula, Theorem 3.2. The last relation follows from Eq. (5.1), which completes 
the proof. □ 


5.3. The Entropy Number. 

Corrollary 5.1 (Entropy Number for Learning Quantum States). Assume the input space is X = £(<C d ). The 
function class T defined on X is the state space Q(C d ). Then for each e > 0, the covering number of the function 
class is 

lo g W 2 (e,Q(C d ),n) = 0(logd/ e 2 ). 

Compared with the entropy number of the effect space, the result of the state space is proportional to the 
logarithmic dimension. The intuition behind this is that the unit ball of Schatten oo-class is much larger than the 
unit ball of Schatten 1-class. Thus, it requires more e-radius ball to cover the whole effect space than the state 
space. From the volumetric perspective, the fact will be more evident. Denote | • | as the Lebesgue measure on the 
Banach space of the Schatten class. It can be calculated that 

|£(c d )| 1/d2 „ f\sj \\ 1/d2 ~ 

|Q(C d )|i/( d2 -D “ V \S (| ) ~ ’ 

which shows that the volume of the effect space is essentially exponential (in the dimension d) to the state space. 
Recall that the complexity measures are the quantity to estimate the effective size of the hypothesis set. Accordingly, 
it is reasonable that the complexity measures of the effect space are exponentially compared with that of the state 
space. In other words, the results of Theorem 4.1 demonstrate the richness of the effect space. 


5.4. The Relationship to Quantum Random Access Codes. The learnability of quantum states was first 
addressed by Aaronson [39] . Ingeniously, he applied the results of Quantum Random Access Coding [92] to provide 
an information-theoretic upper bound on the fat-shattering dimension for learning m-qubit quantum states. We 
first give the definitions of QRA codes then discuss Aaronson’s result. 

Definition 5.1 (Quantum Random Access Coding). An (n,m,p )~QRA coding is a function that maps ro-bit 
strings x £ {0,1}” to m-qubit states p x satisfying the following: For every i £ {1, ...,n} there exists a POVM 
E l = {Eq,E{} such that Tr {E l x .p x ) > p for all x £ {0,1}”, where ais the i -th bit of x. 

If there exists an (n, m,p)-QRA coding, we have the fact that the sets {E,}” =1 are (p — l/2)-shattered by {p y } 
and the constant value 1/2 witnesses the shattering. That is, 

(5.2) m > (1 — H(e + 1/2 ))n > c ■ e 2 n. 
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Therefore, the inequality gives an upper bound on the level fat-shattering dimension, i.e. fatn^tgfo — 1/2) = 
0(m/e 2 ). Conversely, fat-shattering dimension with scale (p—1/2) does not guarantee the existence of an (n, m,p)- 
QRA coding (since there may be some s l < 1/2), while provide an upper bound on the success probability p if it 
exists. 

However, in the case that functions in T have a bounded range of [0,1], Gurvits [93] utilised the Pigeonhole 
principle to relate the level fat-shattering dimension with the fat-shattering dimension. 

Theorem 5.3 (Gurvits [93]). For any hypothesis set T consisting of [0,1 ]-valued functions, we have 

(5.3) (2(1 - 2e)/ e)~ 1 fatjr(2e) < fat^e/ 2) < fat T {e/ 2). 

By definition, fat i-(e) < fatj^(e). However, from the above theorem, the dependencies on the dimension d are of 
the same order for both the level fat-shattering dimension and the fat-shattering dimension. Consequently, from 
Eq. (5.2) we have fatx-(e) = 0(m/e 2 ), which leads to fat^e) = 0(m/e 2 ) according to the inequalities in Eq. (5.3). 
Thus we recover Aaronson’s result. 

Theorem 5.4 (Aaronson [39]). The fat-shattering dimension for learning the class of all m-qubits, JF, is fatjr(e) = 
0(m/e 2 ). 

We remark that it is unknown whether fat-rfe) = fatjr(e) for T = <2(C d ). 

Proposition 5.2. There is no (2 2m , m,p)-QRA coding for 1/2 < p < 1 and positive integer m. 

Hayashi et al. [59] showed that there is no (2 2m , m,p)-QRA coding for 1/2 < p < 1. This result can be directly 
derived from Theorem 5.1, which shows that Pdim(Q(C d )) < d 2 — 1. The dimension d of m-qubit is 2 m . Then the 
upper bound of the pseudo dimension shows that there is no d 2 = 2 2m two-outcome POVMs that can be shattered 
(by the function class of the state space), which coincides with Hayashi et al’s result. 


6. The Algorithms for Quantum Machine Learning 

In the previous sections, we demonstrate the information-theoretical analysis of the quantum learning problems. 
In this section, provide a constructive way to implement quantum ML tasks by representing the learning framework 
in Bloch space. 

We gather all the materials and derivations concerning the Bloch-sphere representation into Appendix C. Recall 
from Eq. (C.6) that the function class of rank-A: effects and their mixture can be represented as the following affine 
functional: 

F k = conv ^{r ^ (l + (d - l)r • n (fe) )}^ , 

where r is the Bloch vector of the quantum state; n^) (see Eq. (C.3)) parameterises the function in the hypothesis 
set J ~ k . Moreover, it can in turn be written as 

Tk = cr(v-r + u 0 ), 

where a : R —> ffi. is called the activation function. The Bloch vector r £ _1 is the input vector; [uq, v] £ is 

the input weights. Each map r i —> er(v ■ r + vq) can be thought of as a function computed by the linear perceptron. 
Using the terminology from the theory of neural network [61], each T k is called the single-layer neural network (see 
Appendix D for more details). 

Considering the function class of the whole effect space, we exploit the convexity of the effect space, and obtain 
the following result: 

(1 + (d - l)r • n (fc) ) =: ^ (n 0 + (d- l)r • n), 

k =0 

where y~)f_ n w k = 1. This is called the two-layer neural network (also called the single-hidden layer net). Based 
on this formulation, the tasks of learning quantum measurements can be implemented by existing neural network 
algorithms or other classical ML algorithms. We note that the neural network formulation for learning quantum 
states follows in the same way by virtue of the duality. 

Additionally, the fat-shattering dimension for can easily be bounded from the classical results in neural 
networks. We have the following corollary. 


Corrollary 6.1. Suppose the hypothesis set T k consists of rank-k projection operators and their mixture. We have 

K,(,)< ‘ ( ^*- t) , i = 1 o.i„... 4 . 

(de) 
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Proof. Since Pk is a linear function class on 1 . invoking the classical results from Anthony and Bartlett [61]: 

. . a 2 b 2 
fatjr(e) < —5-, 
e z 

where P = {w i-q (w, x) : ||x || 2 < b , ||w || 2 < a, x, w £ R 6 * 2-1 }. 

Therefore, it remains to calculate the coefficients in Eq. (C.4). Since ||r || 2 < 1, and 


the result follows. 


T-1 

1 

1 d~k 


d \ 

/ k{d- 1 ) 

V 

2 


fc(d-l)(d-fc) 

I 2 


□ 


We can see from the corollary that the fat-shattering dimension increases when the the rank k approaches a half 
of the Hilbert space dimension d, which means that the classes {Pk} form a hierarchical structure. Operationally, 
the hypothesis set Pi can be chosen at first. It can then be enlarged into conv(J 7 o U Pi U P 2 ) and so forth until the 
whole effect space is considered. This is called the structural risk minimisation (SRM [2]), and is usually adopted 
in classical ML to avoid overfitting. Here we give two examples to illustrate the concepts in Corollary 6.1. 


Example 6.1 (Learning rank-1 Projection Valued Measures (PVMs): Qubit system attains the upper bound). 
The fat-shattering dimension of rank-1 projection operators and their mixture in a qubit system can be bounded 
by 

(N — l ) 2 _ 1 
4e 2 


fatjq (e) < 


(Ne) 2 

Consider two quantum states p ri = |1)(1|, p T2 = |—)(—| with corresponding Bloch vectors iq = (0,0,—1), r 2 = 
(—1,0,0). To shatter these two quantum states, we construct four quantum effects with the Bloch vectors: 

n oo = ^(1,0’ 1)' n io = 1) 0,-1), 

n n = 1)0,-1), n 0 i = —^=(—1,0,1). 

Since the angles between the states and effects are either 7 r /4 or 37 r/ 4 , we have 


(Tr(E noo p ri ),Tr(E noo p r2 )) — (^(1 ^(1 (Tr(E nio p ri ),TK-^nio/^)) — (2 (1 + 2^ y/2^’ 

(Tr(£ nn p ri ), Tr(E nil p r2 )) = ( —(1 + 2 ^ ^^))> ^ r ^ nol ^ ri ^’ ^• r (^'noiPr 2 )) = (1 — 2^ 

Clearly these four quantum effects ^^-shatter (ri,r 2 ) and achieve the fat-shattering dimension fat ^1(^75) = 2. 

The case of three quantum states follows similarly. Consider iq = (1,0,0), r 2 = (0,1,0), r 3 = (0,0,1), and 
n ijk = ( i,j,k ) for i,j,k £ {0,1}. With some calculations, the eight quantum effects ^j-shatter (rq,r 2 ,r 3 ) and 
achieve the fat-shattering dimension fatjr, () = 3 

It is worth emphasising that the dual problem of learning quantum states is equivalent to learning quantum 
measurements when the hypothesis set consists of rank-1 projections and their mixture. The reason is that the 
two mathematical objects are exactly the same, i.e. conv(J-i) = Q(C d ). In this scenario, the dual problem has 
the same results, which is optimal in the sense of Quantum Random Access codes (i.e. (2,1,0.85)-QRA codes [94]). 
Furthermore, we note that the measurements in the (2,1,0.85)-QRA codes and the input states (p r ,,p/ 7 ), (p r 2 ,p^) 
in this example are mutually unbiased bases (MUB) which attain the upper bound of the qubit system. 


Example 6.2 (Rank equals a half the Hilbert space dimension). Consider a quaternary Hilbert space, i.e. C 4 . First, 
we show that there exist no two quantum states that can be 1 / 2 -shattered by the convex hull of rank -1 projection 
operators. Consider two arbitrary different quantum states S = {pi} 2 = i- If the function class Pi can 1/2-shatter 
the set S , then there must be an effect E £ Pi such that Tr(Epi) = Tr(Ep 2 ) = 1. Clearly, it can be achieved only 
when E is a rank-1 projection and the two quantum states are both equal to E, which contradicts the assumption. 

Second, we show there exist two quantum states that can be 1/2-shattered by the rank-2 projection operators. 
Assume p, = \i — 1 )(i — 1 |, i = 1, 2 . We construct four quantum effects as follows: 


f 1 
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in the computational basis. The two quantum states can then be 1/2-shattered by these four quantum effects. This 
example demonstrates that the set of rank-2 projections is richer than the set of rank-1 projections in terms of the 
complexity measures. 

Remark. The readers may contemplate the pros and cons of Bloch-sphere representation when analysing the fat- 
shattering dimension. Indeed, Bloch-sphere representation provides a geometric picture so that we have more 
concrete ideas of the linear relation between quantum measurements and states. Furthermore, in Example 6.1 we 
see how the extreme points (projection operators) and MUB play the role in the fat-shattering dimension. However, 
it is difficult to fully characterise the region of the Bloch space. To the best of our knowledge, the most convenient 
metric used in Bloch-sphere representation is the Euclidean norm, which corresponds to the Hilbert-Schmidt norm 
(Schatten 2-norm) in the state space, i.e. 

llPn - Pr 2 || HS = \j ll r l - r 2 112 - 

Recalling that conv(—Q(C d ) U Q(C d )) = Sf C S d C S= conv (— £(C d ) Li£(C d )), the Hilbert-Schmidt norm is 
not efficient in characterising the state space (that is why some regions in the Bloch sphere are not representative as 
valid states). On the other hand, the unit ball of Schatten 2-class is not sufficient to contain S^, so we have to scale 
up the Hilbert-Schmidt norm by a factor \/d (since || • H 2 < Vd\\ • ||oo). Then we may overestimate the effective size 
of the effect space. As a result, directly analysing the linear functionals between Sf and S ^ is the most efficient way 
of calculating the fat-shattering dimension. We emphasise that with Bloch-sphere representation, all the quantum 
measurements/states are transformed into Euclidean space, where existing ML algorithms (e.g. perceptron learning 
algorithm, neural network, SVM, etc.) can be applied to conduct the learning tasks. It is also worth considering 
other metrics (e.g. Bures metric, or other £ p norms in Bloch-sphere representation) and parameterisation methods 
(e.g. Weyl operator basis, polarisation operator basis, Majorana representation, etc.) in our quantum ML framework. 
We leave it as future work. 

When learning an (M + l)-outcome POVM measurement {Hj}jL 0l with Y/jLo H = we can simply follow the 
procedure discussed so far. Now the training data set consists of {(pi, Tr(IIpj)}” =1 , where 

Tr(LEpj) := (Tr(Hipi),..., Tr(n n p j; )). 

This is called multi-target prediction or multi-label classification. Each target can be independently learned by 
the individual function class J-. 

It is worth mentioning that Gross and Flammia et al. [44, 45] proposed a quantum state tomography method 
via compressed sensing , which is similar to our setting of learning quantum states. The main goal of the work is 
to concentrate on states p that can be well approximated by density matrices of rank r <C d and to reconstruct 
a density matrix p based on m randomly sampled Pauli operators. With certain constraint coefficients A and 
m > Crd log 6 d, they show 

Up - pIIi <c 0 r\ + CM x , 

where p c = p ~ p r is the residual part and p r is the best rank-r approximation to p. 

7. Conclusions 

Table 1. The Complexity Measures of The Quantum Learning Problems. 



Learning Quantum Measurements 

Learning Quantum States 

Pseudo Dimension 

d 2 

d 2 - 1 

Fat-Shattering Dimension fatjr(e) 

d/e 2 

log d/e 2 

Uniform Entropy Number logA/^e, T) 

d/e 2 

log d/e 2 

Rademaclrer Complexity lZ n (£F) 

\[d 

\/log d 

Sample Complexity m^(e,5) 

max{d, log(l/<5)}/e 2 

maxjlog d , log(l/<5) }/e 2 


In this paper, we formalise the problems of learning quantum measurements and quantum states and anal¬ 
yse the learnability. We solved the sample complexity problems for learning quantum measurements and quan¬ 
tum states. In the scenario of learning (two-outcome) quantum measurements, the fat-shattering dimension is 
min{0 (d/e 2 ) ,d 2 {. We also showed that the fat-shattering dimension for its dual problem- learning quantum 
states—is min {O (log d/e 2 ) , d 2 — l}. Our proof is entirely based on tools from classical learning theory, and pro¬ 
vides an alternative proof for Aaronson’s result [39]. We also derived other important complexity measures for 
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these two tasks, and the results are summarized in Table 1. Our results demonstrated that learning an unknown 
measurement is a more daunting task than learning an unknown quantum state. The intuition is that, since the 
effect space is much larger than the state space, it is reasonable that the fat-shattering dimension of the effect space 
is larger, too. 

Finally, by exploiting general Bloch-sphere representation, we show that our learning problems are equivalent 
to a neural network so that classical ML algorithms can be applied to learn the unknown quantum measurement 
or state. Our work could provide a new viewpoint to the study of quantum state and measurement tomography. 
We also discuss connections between the quantum learning problems and other fields in QIP such as existence of 
QRA Codes and quantum state discrimination. We hope that the development of our results would stimulate more 
theoretical studies in quantum statistical learning, and more applications in quantum information processing and 
related areas can be discovered. 


Appendix A. Notation Table 

Appendix B. Sample Complexity in Terms of Complexity Measure 

In Section 2.2, we introduce several complexity measures. In this section, we list some well-known deviation 
formula to express the generalisation error and sample complexity in terms of those complexity measures. 

It has been established that any set of Boolean functions is a uGC class (i.e. PAC learnable) if and only if it 
has a finite VC dimension [95, 96]. Additionally, the finite VC dimension provides an upper bound for the sample 
complexity of the Boolean function class. 

Theorem B.l (Vapnik et al. [67, 95, 96]). Let C be an absolute constant and T be a class of Boolean functions 
which has a finite VC dimension d. Then, for every 0 < e, S < 1, 


sup Pr 


f 


< sup 

[/ 

L{f)~L n {f) 



provided that n > ^ (d log(2/e) + log(2/<5)). 

Therefore, the sample complexity is bounded by 


(B.l) 


c 

mjr(e,S) < — max i d log - ,log ^ 


<< 5 , 


Following the same reasoning as in Theorem B.l, the analogous results can be drawn: the hypothesis set J- is a 
uGC class if and only if it has a finite fat-shattering dimension for every e > 0 [79, 97, 98]. We have the following 
theorem: 


Theorem B.2 (Bartlett et al. [79, 97, 98]). There is an absolute constant C such that for every T consisting of 
bounded functions and every 0 < e, 5 < 1, 


sup Pr 


f 


< sup 

L(f)-L n (f) 



<<5, 


provided that n > ^ (fatj r(e/8) • log(2/e) + log(8/<5)). 
Therefore, the sample complexity is bounded by 


(B.2) 


c 

mjr(e, 8) < -= max 


fatjr(e) - log-,log 



The entropy number is distribution-independent and is closely related to the learnability of the function class. 
Dudley et al. [99] showed that a class J- consisting of bounded functions is a uGC class if and only if that there is 
some 1 < p < oo such that for every e > 0, 


lim 

n—too 


logA/' p (e, T, n) 
n 


= 0 . 


In addition, we have the following theorem: 


Theorem B.3 (Polland [73]). Let J 7 be a set of bounded functions. 

(i) For every 0 < e < 1, any n > 8/e 2 , and any probability measure p, 


Pr 


< sup 


L(f) ~ L n (f) > e ? < &Afi{e/8,F,n)e-xp(- — ). 
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Table 2. Summary of Notation 


Notation Mathematical Meaning 


U 

d 

R, N 

C d 

M d 

Tr 

A* 

(A,B) 

B{n) 

nn) 

o 

X 

A>B 
l|M|| p 
S d p 
I <fi) 

Pi c 

e, n 
Q(H) 

m) 

X 

y 

z 

i 

p 

Z n 

If :Z-> (0, oo ) 
Pr, E 

m 

L n (f) 

VCdim(J-) 

Pdim(J r ) 

fatjr(e) 

fatjr(e) 

A f(e,T,r) 
log A/■(£, J 7 , r) 

n n (T) 

7* 

O 

A<B 


A~B 


the (separable) Hilbert space 

the dimension of the linear space 

the set of real numbers and positive integers 

the linear space of d-dimensional complex vectors 

the set of all self-adjoint operators on C d 

the trace function on 

the conjugate transpose of A 

= Tr (fit A), the Hilbert-Schmidt inner product on 

also stands for conventional inner product on C d 

the set of bounded operators on H 

the set of trace class operators (i.e. finite trace) on H 

the zero operator on H. 

the identity operator on %. 

= A — B F O, the standard partial ordering 

the Scliatten p-norm on Mj, which reduces to the £ p norms on C d . 

={AI € M d : \\M\\ p < 1}, the unit ball of Schatten p-class 
the unit vector on H 

the quantum state on n, i.e. p = p' £ T{B ), with Tr(p) = 1 

the POVM element on n, i.e. O 

state space, the set of all states on H 

effect space, the set of all POYM elements on H 

the input space, or called the instances domain (the set) 

the output space, or called the labels domain (the set) 

= X x y 

the hypothesis set of functions / : X —> y 
a distribution on Z 

a training data set of n elements independently according to p 
loss function 

probability and expectation of a random variable 
= E p [£f(X, Y)], the ensemble error 

= 1/n Y^i =i ^/(AA Vj), the empirical error over the training data set Z„ 

Vapnik-Clrervonenkis dimension of the function class T 

pseudo dimension of the function class T 

fat-slrattering dimension of the function class T with e > 0 

level fat-shattering dimension of the function class T with e > 0 

covering number of T with metric r and e > 0 

entropy number 

Rademacher complexity of the function class T on Z n 

uniformly {+1, — l}-valued random variables or called Rademacher variables 

the big O notation; / = 0(g) means f(x) < cg(x) 

for some positive c, £0 and all x > xo 

= A < cB 

for some constant c 

both A < B and A > B 


(ii) For every 0 < e, S < 1, 


sup Pr 


sup 

/e .? 7 


L(f)-L n (f) 



< 6 , 


provided that n > ^ (logA/*i(e, T) + log(2/5)). 
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(B.3) 


Therefore, the sample complexity is bounded by 

C 


6) < — max <J A/j (e,T), log - } . 


Theorem B.4 (Bartlett and Mendelson [67]). For any 0 < <5 < 1, with probability at least 1 — S and for all f £ T 
we have, 


Pr ^ sup L(f) - L n (f) 
[f 6^ 

provided that n > ^ max{72. T! (J r ),log(l/(5)}. 

Therefore, the sample complexity is bounded by 

C 


> e } < <5, 


(B.4) 


mjr(e,S) < max < lZ n (X), log - 


Appendix C. Learning Framework in Bloch-sphere Representation 


When illustrating the state space on a finite dimensional Hilbert space C d , it is convenient to adopt a geometric 
parameterisation method called Bloch-sphere representation [100-102]. Here, we provide another point of view on 
our quantum learning framework. The key idea is to represent the quantum objects in a Euclidean space, wherein 
classical techniques of traditional ML can be applied. Although the Bloch-sphere representation method may not be 
as direct as the machinery we used in Sections 4 and 5, it does gain more insights into our quantum ML problems. 

Based on the orthogonal basis {I, Ai,..., A d 2 -i} of SU(d), any state p r on C d can be represented in a Bloch 
vector r through: 


(C.l) 



d 2 — l \ 

I + r jAj 


1 

d 


{I + c d r- A), 


where c d '■= y d ^ d 2 ^ and the dot product corresponds to the conventional Euclidean inner product, and 

2 ^ d _ yy Tr (p r A,) eR,i = l,...,d 2 -l. 

Define the Bloch vector space as the set of Bloch vectors, which are representative of the valid states on C d as 

f! d := {r e K^- 1 : r = yCW T r( ft - A)). 

Now we calculate the linear functional of E n £ acting on the state p r (where £k denotes the convex hull of 
rank-/c projection operators): 

Tr(P n p r ) = Tr ^(X + c d r • A)(J + c d n ■ A)^ 

= Tr ^[X + c d { r • A + n • A) + c 2 d { r • A)(n • A)]J 

= ^ + | Tr((r - A)(n - A)) 

= + ^r-n). 

Consequently, we have the affine functionals with elements in the convex hull of rank-1 projection operators, i.e. 

Xi = {p r ^(1 + (d - l)r • n) : n € fid}. 

In order to characterise the quantum effects associate with higher dimensional projection operators, it is useful 
to consider the algebraic properties of the projection operators. The set of projection operators on C d is not 
a vector space but corresponds to an orthocomplemented lattice. Therefore, the sum of two projections, say 
P and Q , is a projection only when they are orthogonal, i.e. PQ = QP = O. Based on this fact, now let 
{P ni ,..., Pj\ d } be arbitrary mutually orthogonal rank-one projections on C d . To each of them, we associate a unit 
Bloch vector n, such that P ni = ^(X + c d r\i ■ A), i = 1,... ,d. It can be verified by Eq. (C.l) that the Bloch 
vectors {n.!,... ,rid} form a (d— l)-dimensional (regular) simplex since the angle between any two Bloch vectors is 
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0(n,;, rij) = cos _1 (— With a slight abuse of notation, denote a rank-fc projection P n as the summation of 

arbitrary k different projections from the set {P ni ,..., P nd }. More formally, we denote an index set C {1,. .., d} 
with cardinality k , and P n(fc = Pun where we adopt the convention that the empty sum is zero. Hence, 

when a rank-/c projection P n £ Fk acts on the state p r , we have: 

Tr(P„ (fc) p r ) = i(l + (d-l)r-ni) 
ieik 

(C.2) = k ■ ^(1 + (d- l)r ■ n (fe) ), 

where 

( c - 3 ) n (fc) : = l n * 

ieh 

is the centroid of the (k — l)-face of the simplex A d _i subtended by the vectors {n i } ie x fc .The l%- norm of can 
be calculated as the Euclidean distance from the center of the simplex A^-i to the centroid of (k — l)-face; that is 

(C.4) || n (fe)II 2 :=r d , fc = \j k ^_\j < 1, fce {1,2,...,d}. 

Intuitively, we can interpret the value Tr(P n(fc) p r ) as an operator P n acting on the state p r , and then scaled by k. 

Since every quantum effect can be composed into the extremal effects (i.e. projection operators) of the effect 
space [103]. We can represent Tr(P n p r ) for all E n £ £(C d ) as: 

(C.5) ( 1 + ( d ~ 1 ) r ' n (fc)) = ^ {no + (d - l)r • n), 

k=0 

where Yfi=o w k = 1, 0 < n 0 < d and ||n|| 2 < max fe6{0jl . ^ k{ d Ii ] ■ 

By utilising the bisection relationship of quantum state p r and its corresponding Bloch vectors r, we can associate 
the input space as the Bloch vector space, i.e. X = Denote the function class as the linear functionals of 
acting on p r . According to Eq. (C.2), we have: 

(C.6) Jfc = conv ^{r ^ (l + (d - l)r • n (fe ))}^ . 

For the rank-0 projection operator, the class consists of only one element, i.e. Po = {O}. We can see from the above 
equation that the affine coefficient is fixed such that consists of linear functionals. For the class of all quantum 
effects P = £(C d ), by Eq. (C.5) we have a similar result: 

P = {r i->- - (n 0 + (d — l)r • n) : n £ R d -1 }, r £ Q d , 
a 

where no can be upper bounded by d and || n|| 2 can be bounded by k ■ r d ,k = \j • Clearly, P = £(7i) is the 
function class consisting of the affine functionals. However, we can easily convert this formulation into a linear 
form by letting r = [1, r], and n = [no, n]. The intuition behind this is that when characterising the learnability of 
quantum measurements, all we need is to bound the complexity measures of the class of linear functionals. 

Appendix D. Neural Networks 

Here we briefly introduce the theory of Neural Networks. Readers may refer to Ref. [61] for more details. The 
basic computing unit in a neural network is the (simple) perceptron (see Fig. 2), which computes a function from 
to R: 

/(r) = cr(v • r + u 0 ), 

for input vector r £ R d , where v = (iq,..., v d ) £ R d and Vq £ R are adjustable parameters, or weights (the 
particular weight Vq being known as the threshold). The function a : R —> R is called the activation function. In the 
scenario of binary classification, the activation function may be chosen as the sign function; in the case of real-value 
outputs, ct(-) may satisfy some Lipschitz conditions. Note that the decision boundary of the binary perceptrons is 
the affine subspace of R d defined by the equation v • r + Vq = 0. 

When using a simple perceptron for a binary classification problem, the perceptron learning algorithm (PCA) 
finds adequate parameters v and Vq to well fit the training data set. The algorithm starts from an arbitrary initial 
parameter and updates the parameter when there are misclassified data. For example, if now the function computes 
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(r ,y) (with r £ R d and y £ {0,1}), the algorithm adds r](y — /(r))[r, —1] element-wise to [v,zj 0 ], where r) is a fixed 
step constant. PCA iterates until a termination criterion is reached. 

The second example is the two-layer networks (also called single-hidden layer nets ) (see Fig. 3). The network 
can compute a function of the form 

k 

/(r) = ^2 w k<?(vi • r + v 0i ) + w 0 , 

i=l 

where Wi £ R,* = 0,..., fc, are the output weights, [v^uoi] are the input weights. The positive integer k is the 
number of hidden units. One can use a ‘gradient descent’ procedure to adjust the parameters to minimize the 
squared errors over the training data. 


Figure 2. Consider a qubit system. A measurement in T\ can be characterised by a simple 
perceptron with 3-dimensional input data and the activation function a. The T’ node is a bias 
node and vq is the corresponding bias weight. The input vector is the Bloch vector reilj. The 
output variable y = /(r) is computed by the simple perceptron. Hence the problem of learning an 
unknown measurement n £ T\ is to infer the simple perceptron, i.e. the values of {vi}j =1 . 


Input Vector Summation Activation 
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Figure 3. Single-hidden layer net computes 3-dimensional input data with activation function a 
and three hidden units, which correspond to T % for * = 0,1,2. The value vq k corresponds to the 
bias weight of the fc-th hidden unit. The single-hidden net represents a quantum measurement in 
£(C 1 2 3 4 5 6 ). 
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